-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815
Conversation
Pull Request Test Coverage Report for Build 13239732856Details
💛 - Coveralls |
Hey @sjrl I've tested your implementation a little bit Performance:
which, on average, is a 2x improvement in terms of speed vs the previous BFS approach. 💪 good job Correctness def test_complex_bridging(self) -> None:
"""
Rows bridging from left to right => BFS splits each row into left block & right block.
"""
csv_data = """ID,LeftVal,,,RightVal,Extra
1,Hello,,,World,Joined
2,StillLeft,,,StillRight,Bridge
A,B,,,C,D
E,F,,,G,H
"""
splitter = CSVDocumentSplitterV2(row_split_threshold=1, column_split_threshold=1)
result = splitter.run([Document(content=csv_data)])
docs = result["documents"]
assert len(docs) == 4
block_texts = [doc.content for doc in docs]
assert any("ID,LeftVal" in text for text in block_texts)
assert any("Hello" in text for text in block_texts)
assert any("World,Joined" in text for text in block_texts)
assert any("StillLeft" in text for text in block_texts)
assert any("StillRight,Bridge" in text for text in block_texts)
assert any("A,B" in text for text in block_texts)
assert any("C,D" in text for text in block_texts)
assert any("E,F" in text for text in block_texts)
assert any("G,H" in text for text in block_texts) In this scenario, the expected output should yield 4 documents rather than 2. Usability |
Thanks for taking a look @alex-stoica! I’ll make the changes you suggest and add the test cases for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - we should now also have an issue to keep track the documentation for this new component
Opened issue here: #8835 |
Related Issues
Proposed Changes:
Alternative approach as discussed in this PR: #8795
How did you test it?
Added unit tests.
Notes for the reviewer
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.